1. Gene Expression Matries

Level Quantifier Metrics filePath
Transcript RSEM Raw counts /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_transcriptLevel.RSEM_Count.txt
Transcript RSEM TPM /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_transcriptLevel.RSEM_TPM.txt
Transcript RSEM FPKM /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_transcriptLevel.RSEM_FPKM.txt
Transcript Salmon Raw counts /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_transcriptLevel.Salmon_Count.txt
Transcript Salmon TPM /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_transcriptLevel.Salmon_TPM.txt
Gene RSEM Raw counts /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_geneLevel.RSEM_Count.txt
Gene RSEM TPM /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_geneLevel.RSEM_TPM.txt
Gene RSEM FPKM /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_geneLevel.RSEM_FPKM.txt
Gene Salmon Raw counts /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_geneLevel.Salmon_Count.txt
Gene Salmon TPM /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/01_expressMatrix_geneLevel.Salmon_TPM.txt

2. Alignment statistics

Alignment statistics helps to tell: 1)if you have sequenced enough reads? Usually, >50M mapped reads and >25M uniquely-mapped reads are expected; 2)if your reads have been over-trimmed? Usually, >90% Percentage of Filtered Reads is expected. Over-trimming could happen when the insert size is too small or you picked the wrong Phred score encoding method; 3)if your samples are contaminated? Usually, >75% Percentage of Mapped Reads is expected. The most common reason for low mapping rate is the contamination of either DNA or rRNA.

Sample Name Number of Total Reads Number of Filtered Reads Percentage of Filtered Reads Number of Mapped Reads Percentage of Mapped Reads Number of Uniquely-mapped Reads Percentage of Uniquely-mapped Reads
sample1 38820746 38327088 98.73% 31672982 82.64% 29326030 92.59%
sample5 27712150 27534450 99.36% 23095284 83.88% 21399082 92.66%
sample2 16005450 16004460 99.99% 1125972 7.04% 800744 71.12%
* Percentage of Uniquely-mapped Reads = Number of Uniquely-mapped Reads / Number of Mapped Reads
* Percentage of Mapped Reads = Number of Mapped Reads / Number of Filtered Reads

3. Quantification statistics

Quantification statistics helps to tell how accurate the quantification results are: 1)how many transcripts/genes were identified confidently? Usually, >65K transcripts and/or >15K genes are expected; 2)how accurate the quantification is? Usually, >0.85 correlation coefficient is expected.

3.1 Transcript level

Sample Name Identified by RSEM Identified by Salmon Identified by Both Coef_Pearson Pval_Pearson rho_Spearman Pval_Spearman
sample5 85686 80064 73906 0.985 0 0.9712 0
sample1 91688 87064 80009 0.979 0 0.9698 0
sample2 17121 22611 11230 0.2311 4.78723092812193e-136 0.8404 0
* Only co-identified transcripts/genes were used in correlation analysis.

3.2 Gene level

Sample Name Identified by RSEM Identified by Salmon Identified by Both Coef_Pearson Pval_Pearson rho_Spearman Pval_Spearman
sample5 21237 21422 20535 0.9856 0 0.9883 0
sample1 22114 22431 21364 0.9798 0 0.9877 0
sample2 9735 12847 9233 0.2505 4.22481223109004e-132 0.8377 0
* Only co-identified transcripts/genes were used in correlation analysis.

4. Biotype distribution

Biotype distribution provides the composition of types of identified transcripts and genes. Since different bioytpes of transcripts/genes vary hugely in length, GC-content and other properties, Biotype distribution could serve as a measure of quantification quality. Empirically, 1) for total RNA libraries, the protein_coding transcripts and genes should account for >40% and > 50%, respectively; 2) for mRNA libraries, the protein_coding transcripts and genes should accounts for >80. For more details about the biotypes: https://www.gencodegenes.org/pages/biotypes.html.

4.1 Transcript level

sampleName protein_coding retained_intron lncRNA protein_coding_CDS_not_defined nonsense_mediated_decay Others processed_pseudogene
sample2 55.57% 14.65% 2.46% 12.31% 10.17% 1.35% 3.49%
sample5 44.29% 20.2% 15.88% 9.17% 8.6% 1.87% NA
sample1 43.19% 20.26% 16.79% 9.15% 8.71% 1.9% NA
* The biotypes of <1% were marked as ‘NA’ and merged into ‘Others’.

4.2 Gene level

sampleName protein_coding lncRNA processed_pseudogene TEC Others
sample2 87.68% 3.91% 6.14% NA 2.26%
sample5 64.97% 27.92% 3.51% 1.17% 2.43%
sample1 63.41% 29.16% 3.57% 1.23% 2.63%
* The biotypes of <1% were marked as ‘NA’ and merged into ‘Others’.

5. Genebody coverage statistics

Genebody coverage statistics calculates the RNA-Seq reads overage over gene body. 1) Mean of Coverage is the average coverage of all 100 bins of gene body. Usually, >0.7 is expected. 2) Coefficient of Skewness is a measure of the asymmetry of the distribution of gene body coverage. Fisher’s moment coefficient of skewness was calculated by default. The closer to 0, the more symmetric. For more details: https://en.wikipedia.org/wiki/Skewness.

sampleName meanCoverage coefSkewness
sample5 0.8555260 0.0709431
sample1 0.8449504 0.0756538
sample2 0.5527551 0.3571358